EDA - Home Credit¶
Problem Statement:¶
Conduct a comprehensive analysis of a dataset containing various financial and demographic attributes of loan applicants. Our goal is to gain insights into the factors influencing loan default rates and to develop strategies to mitigate risks associated with lending.
Objectives:¶
Task 1 - Importing necessary Modules:
- Import the modules necessary for Data Manipulation and Visualization.
Task 2 - Reading dataset:
- Read the dataset containing loan applicant information.
Task 3 - Exploring the Dataset:
- Understand the Structure and various datatypes of the attributes within the dataset.
Task 4 - Missing value analysis:
- Identify and analyze missing values in the dataset.
Task 5 - Analysing categorical and numerical columns:
- Analyze categorical and numerical columns to understand the statistical properties and relationships within the dataset.
Task 6 - Univariate Analysis:
- Conduct univariate analysis to explore the distribution and characteristics of individual variables.
Task 7 - Outliers:
- Identify and analyze outliers within the dataset to understand their impact on the analysis.
Task 8 - Merging Datasets:
- Identify and merge different Datasets for further analysis.
Task 9 - Bivariate analysis:
- Conduct bivariate analysis to explore relationships between different variables and their impact on loan default rates.
Task 1 - Import the dataset¶
Description¶
In this task, you will import all the necessary modules and packages required for performing various operations in the project.
import pandas as pd # For data manipulation and analysis
import numpy as np # For numerical computation
import matplotlib.pyplot as plt # For plotting
import seaborn as sns # For advanced statistical visualizations
sns.set(color_codes=True) # Set seaborn color palette to default
import warnings
# Filter out warnings to ignore them
warnings.filterwarnings('ignore')
Task 2 - Reading the Dataset¶
# Read the CSV file 'application_data.csv' into a DataFrame
application_train = pd.read_csv('application_data.csv')
Task 3 - Exploring the Data¶
# Display the first few rows of the DataFrame to inspect its structure and contents
application_train.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 122 columns
# Get the dimensions of the DataFrame (number of rows and columns)
application_train.shape
(211607, 122)
# Get concise summary information about the DataFrame, including column data types and non-null counts
application_train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 211607 entries, 0 to 211606 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(101), int64(5), object(16) memory usage: 197.0+ MB
application_train.iloc[:,:100].info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 211607 entries, 0 to 211606 Data columns (total 100 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SK_ID_CURR 211607 non-null int64 1 TARGET 211607 non-null int64 2 NAME_CONTRACT_TYPE 211607 non-null object 3 CODE_GENDER 211607 non-null object 4 FLAG_OWN_CAR 211607 non-null object 5 FLAG_OWN_REALTY 211607 non-null object 6 CNT_CHILDREN 211607 non-null int64 7 AMT_INCOME_TOTAL 211607 non-null float64 8 AMT_CREDIT 211607 non-null float64 9 AMT_ANNUITY 211598 non-null float64 10 AMT_GOODS_PRICE 211434 non-null float64 11 NAME_TYPE_SUITE 210739 non-null object 12 NAME_INCOME_TYPE 211607 non-null object 13 NAME_EDUCATION_TYPE 211607 non-null object 14 NAME_FAMILY_STATUS 211607 non-null object 15 NAME_HOUSING_TYPE 211607 non-null object 16 REGION_POPULATION_RELATIVE 211607 non-null float64 17 DAYS_BIRTH 211607 non-null int64 18 DAYS_EMPLOYED 211607 non-null int64 19 DAYS_REGISTRATION 211607 non-null float64 20 DAYS_ID_PUBLISH 211606 non-null float64 21 OWN_CAR_AGE 71891 non-null float64 22 FLAG_MOBIL 211606 non-null float64 23 FLAG_EMP_PHONE 211606 non-null float64 24 FLAG_WORK_PHONE 211606 non-null float64 25 FLAG_CONT_MOBILE 211606 non-null float64 26 FLAG_PHONE 211606 non-null float64 27 FLAG_EMAIL 211606 non-null float64 28 OCCUPATION_TYPE 145141 non-null object 29 CNT_FAM_MEMBERS 211604 non-null float64 30 REGION_RATING_CLIENT 211606 non-null float64 31 REGION_RATING_CLIENT_W_CITY 211606 non-null float64 32 WEEKDAY_APPR_PROCESS_START 211606 non-null object 33 HOUR_APPR_PROCESS_START 211606 non-null float64 34 REG_REGION_NOT_LIVE_REGION 211606 non-null float64 35 REG_REGION_NOT_WORK_REGION 211606 non-null float64 36 LIVE_REGION_NOT_WORK_REGION 211606 non-null float64 37 REG_CITY_NOT_LIVE_CITY 211606 non-null float64 38 REG_CITY_NOT_WORK_CITY 211606 non-null float64 39 LIVE_CITY_NOT_WORK_CITY 211606 non-null float64 40 ORGANIZATION_TYPE 211606 non-null object 41 EXT_SOURCE_1 92257 non-null float64 42 EXT_SOURCE_2 211138 non-null float64 43 EXT_SOURCE_3 169578 non-null float64 44 APARTMENTS_AVG 104102 non-null float64 45 BASEMENTAREA_AVG 87640 non-null float64 46 YEARS_BEGINEXPLUATATION_AVG 108328 non-null float64 47 YEARS_BUILD_AVG 70860 non-null float64 48 COMMONAREA_AVG 63648 non-null float64 49 ELEVATORS_AVG 98733 non-null float64 50 ENTRANCES_AVG 104974 non-null float64 51 FLOORSMAX_AVG 106237 non-null float64 52 FLOORSMIN_AVG 67976 non-null float64 53 LANDAREA_AVG 85853 non-null float64 54 LIVINGAPARTMENTS_AVG 66902 non-null float64 55 LIVINGAREA_AVG 105325 non-null float64 56 NONLIVINGAPARTMENTS_AVG 64662 non-null float64 57 NONLIVINGAREA_AVG 94778 non-null float64 58 APARTMENTS_MODE 104102 non-null float64 59 BASEMENTAREA_MODE 87640 non-null float64 60 YEARS_BEGINEXPLUATATION_MODE 108328 non-null float64 61 YEARS_BUILD_MODE 70860 non-null float64 62 COMMONAREA_MODE 63648 non-null float64 63 ELEVATORS_MODE 98733 non-null float64 64 ENTRANCES_MODE 104974 non-null float64 65 FLOORSMAX_MODE 106237 non-null float64 66 FLOORSMIN_MODE 67976 non-null float64 67 LANDAREA_MODE 85853 non-null float64 68 LIVINGAPARTMENTS_MODE 66902 non-null float64 69 LIVINGAREA_MODE 105325 non-null float64 70 NONLIVINGAPARTMENTS_MODE 64662 non-null float64 71 NONLIVINGAREA_MODE 94778 non-null float64 72 APARTMENTS_MEDI 104102 non-null float64 73 BASEMENTAREA_MEDI 87640 non-null float64 74 YEARS_BEGINEXPLUATATION_MEDI 108328 non-null float64 75 YEARS_BUILD_MEDI 70860 non-null float64 76 COMMONAREA_MEDI 63648 non-null float64 77 ELEVATORS_MEDI 98733 non-null float64 78 ENTRANCES_MEDI 104974 non-null float64 79 FLOORSMAX_MEDI 106237 non-null float64 80 FLOORSMIN_MEDI 67976 non-null float64 81 LANDAREA_MEDI 85853 non-null float64 82 LIVINGAPARTMENTS_MEDI 66902 non-null float64 83 LIVINGAREA_MEDI 105325 non-null float64 84 NONLIVINGAPARTMENTS_MEDI 64662 non-null float64 85 NONLIVINGAREA_MEDI 94778 non-null float64 86 FONDKAPREMONT_MODE 66867 non-null object 87 HOUSETYPE_MODE 105311 non-null object 88 TOTALAREA_MODE 109399 non-null float64 89 WALLSMATERIAL_MODE 103942 non-null object 90 EMERGENCYSTATE_MODE 111260 non-null object 91 OBS_30_CNT_SOCIAL_CIRCLE 210910 non-null float64 92 DEF_30_CNT_SOCIAL_CIRCLE 210910 non-null float64 93 OBS_60_CNT_SOCIAL_CIRCLE 210910 non-null float64 94 DEF_60_CNT_SOCIAL_CIRCLE 210910 non-null float64 95 DAYS_LAST_PHONE_CHANGE 211605 non-null float64 96 FLAG_DOCUMENT_2 211606 non-null float64 97 FLAG_DOCUMENT_3 211606 non-null float64 98 FLAG_DOCUMENT_4 211606 non-null float64 99 FLAG_DOCUMENT_5 211606 non-null float64 dtypes: float64(79), int64(5), object(16) memory usage: 161.4+ MB
application_train.iloc[:,100:122].info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 211607 entries, 0 to 211606 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 FLAG_DOCUMENT_6 211606 non-null float64 1 FLAG_DOCUMENT_7 211606 non-null float64 2 FLAG_DOCUMENT_8 211606 non-null float64 3 FLAG_DOCUMENT_9 211606 non-null float64 4 FLAG_DOCUMENT_10 211606 non-null float64 5 FLAG_DOCUMENT_11 211606 non-null float64 6 FLAG_DOCUMENT_12 211606 non-null float64 7 FLAG_DOCUMENT_13 211606 non-null float64 8 FLAG_DOCUMENT_14 211606 non-null float64 9 FLAG_DOCUMENT_15 211606 non-null float64 10 FLAG_DOCUMENT_16 211606 non-null float64 11 FLAG_DOCUMENT_17 211606 non-null float64 12 FLAG_DOCUMENT_18 211606 non-null float64 13 FLAG_DOCUMENT_19 211606 non-null float64 14 FLAG_DOCUMENT_20 211606 non-null float64 15 FLAG_DOCUMENT_21 211606 non-null float64 16 AMT_REQ_CREDIT_BUREAU_HOUR 183034 non-null float64 17 AMT_REQ_CREDIT_BUREAU_DAY 183034 non-null float64 18 AMT_REQ_CREDIT_BUREAU_WEEK 183034 non-null float64 19 AMT_REQ_CREDIT_BUREAU_MON 183034 non-null float64 20 AMT_REQ_CREDIT_BUREAU_QRT 183034 non-null float64 21 AMT_REQ_CREDIT_BUREAU_YEAR 183034 non-null float64 dtypes: float64(22) memory usage: 35.5 MB
Task 4 - Missing Values Analysis¶
Calculating the Percentage of Missing values¶
# Calculate the proportion of non-missing values for each column
train_missing = application_train.count() / len(application_train)
# Convert the proportions to percentages and calculate the percentage of missing values for each column
train_missing = (1 - train_missing) * 100
# Sort the missing percentages in descending order and display the top 60 columns
train_missing.sort_values(ascending=False).head(60)
COMMONAREA_AVG 69.921600 COMMONAREA_MODE 69.921600 COMMONAREA_MEDI 69.921600 NONLIVINGAPARTMENTS_MEDI 69.442410 NONLIVINGAPARTMENTS_MODE 69.442410 NONLIVINGAPARTMENTS_AVG 69.442410 FONDKAPREMONT_MODE 68.400384 LIVINGAPARTMENTS_AVG 68.383844 LIVINGAPARTMENTS_MEDI 68.383844 LIVINGAPARTMENTS_MODE 68.383844 FLOORSMIN_MODE 67.876299 FLOORSMIN_AVG 67.876299 FLOORSMIN_MEDI 67.876299 YEARS_BUILD_AVG 66.513395 YEARS_BUILD_MODE 66.513395 YEARS_BUILD_MEDI 66.513395 OWN_CAR_AGE 66.026171 LANDAREA_MEDI 59.428091 LANDAREA_AVG 59.428091 LANDAREA_MODE 59.428091 BASEMENTAREA_MODE 58.583601 BASEMENTAREA_MEDI 58.583601 BASEMENTAREA_AVG 58.583601 EXT_SOURCE_1 56.401726 NONLIVINGAREA_MODE 55.210366 NONLIVINGAREA_AVG 55.210366 NONLIVINGAREA_MEDI 55.210366 ELEVATORS_AVG 53.341336 ELEVATORS_MEDI 53.341336 ELEVATORS_MODE 53.341336 WALLSMATERIAL_MODE 50.879697 APARTMENTS_AVG 50.804085 APARTMENTS_MEDI 50.804085 APARTMENTS_MODE 50.804085 ENTRANCES_MODE 50.392000 ENTRANCES_MEDI 50.392000 ENTRANCES_AVG 50.392000 HOUSETYPE_MODE 50.232743 LIVINGAREA_MEDI 50.226127 LIVINGAREA_MODE 50.226127 LIVINGAREA_AVG 50.226127 FLOORSMAX_MODE 49.795139 FLOORSMAX_AVG 49.795139 FLOORSMAX_MEDI 49.795139 YEARS_BEGINEXPLUATATION_MODE 48.806987 YEARS_BEGINEXPLUATATION_MEDI 48.806987 YEARS_BEGINEXPLUATATION_AVG 48.806987 TOTALAREA_MODE 48.300860 EMERGENCYSTATE_MODE 47.421399 OCCUPATION_TYPE 31.410114 EXT_SOURCE_3 19.861819 AMT_REQ_CREDIT_BUREAU_HOUR 13.502861 AMT_REQ_CREDIT_BUREAU_WEEK 13.502861 AMT_REQ_CREDIT_BUREAU_MON 13.502861 AMT_REQ_CREDIT_BUREAU_YEAR 13.502861 AMT_REQ_CREDIT_BUREAU_DAY 13.502861 AMT_REQ_CREDIT_BUREAU_QRT 13.502861 NAME_TYPE_SUITE 0.410194 DEF_60_CNT_SOCIAL_CIRCLE 0.329384 OBS_60_CNT_SOCIAL_CIRCLE 0.329384 dtype: float64
Missing value treatment¶
As you can observe, there are lot of columns with missing values. There are some columns which has missing values around or more than 50%. Other columns has significantly less missing value. Also, the columns for which has missing values are around or more than 50% are mostly either mean, median or mode. So, there is no way one can replace these missing data. So, we will not consider these columns for analysis. We will consider other columns for analysis. Let's analyse the other columns.
Removing the Columns with more the 50% missing values¶
# Filter the DataFrame 'application_train' to include only columns with less than 50% missing values
train = application_train.loc[:, train_missing < 50]
# Calculate the proportion of non-missing values for each column
train_missing_2 = train.count() / len(train)
# Convert the proportions to percentages and calculate the percentage of missing values for each column
train_missing_2 = (1 - train_missing_2) * 100
# Sort the missing percentages in descending order
train_missing_2[train_missing_2 > 0].sort_values(ascending=False)
FLOORSMAX_MEDI 49.795139
FLOORSMAX_AVG 49.795139
FLOORSMAX_MODE 49.795139
YEARS_BEGINEXPLUATATION_AVG 48.806987
YEARS_BEGINEXPLUATATION_MEDI 48.806987
...
FLAG_DOCUMENT_16 0.000473
FLAG_DOCUMENT_15 0.000473
FLAG_DOCUMENT_14 0.000473
FLAG_DOCUMENT_20 0.000473
FLAG_DOCUMENT_21 0.000473
Length: 64, dtype: float64
# OCCUPATION_TYPE 31.345545
# EXT_SOURCE_3 19.825307
# AMT_REQ_CREDIT_BUREAU_HOUR 13.501631
# AMT_REQ_CREDIT_BUREAU_QRT 13.501631
# AMT_REQ_CREDIT_BUREAU_MON 13.501631
# AMT_REQ_CREDIT_BUREAU_WEEK 13.501631
# AMT_REQ_CREDIT_BUREAU_DAY 13.501631
# AMT_REQ_CREDIT_BUREAU_YEAR 13.501631
# NAME_TYPE_SUITE 0.420148
# DEF_30_CNT_SOCIAL_CIRCLE 0.332021
# OBS_60_CNT_SOCIAL_CIRCLE 0.332021
# DEF_60_CNT_SOCIAL_CIRCLE 0.332021
# OBS_30_CNT_SOCIAL_CIRCLE 0.332021
# EXT_SOURCE_2 0.214626
# AMT_GOODS_PRICE 0.090403
# AMT_ANNUITY 0.003902
# CNT_FAM_MEMBERS 0.000650
# DAYS_LAST_PHONE_CHANGE 0.000325
Let's analylize these columns one by one
Column: OCCUPATION_TYPE¶
The OCCUPATION_TYPE column has a large amount of missing data, about 31%. Although this might suggest removing the column, when we look at it closely, it seems to be an important piece of information. So, we've chosen to keep the column and not analyze it further.
# Display the first few rows of the 'OCCUPATION_TYPE' column in the DataFrame 'train'
train['OCCUPATION_TYPE'].head()
0 Laborers 1 Core staff 2 Laborers 3 Laborers 4 Core staff Name: OCCUPATION_TYPE, dtype: object
#application_train['OCCUPATION_TYPE'].count.plot.hist()
sns.countplot(x= 'OCCUPATION_TYPE', data = train)
plt.xticks(rotation=90)
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17], [Text(0, 0, 'Laborers'), Text(1, 0, 'Core staff'), Text(2, 0, 'Accountants'), Text(3, 0, 'Managers'), Text(4, 0, 'Drivers'), Text(5, 0, 'Sales staff'), Text(6, 0, 'Cleaning staff'), Text(7, 0, 'Cooking staff'), Text(8, 0, 'Private service staff'), Text(9, 0, 'Medicine staff'), Text(10, 0, 'Security staff'), Text(11, 0, 'High skill tech staff'), Text(12, 0, 'Waiters/barmen staff'), Text(13, 0, 'Low-skill Laborers'), Text(14, 0, 'Realty agents'), Text(15, 0, 'Secretaries'), Text(16, 0, 'IT staff'), Text(17, 0, 'HR staff')])
This column exhibits a significant proportion of missing values, amounting to 31%. Given that it is a categorical column, the only viable substitution for missing data is with the mode value. However, substituting all these missing values with the "Laborers" category wouldn't be appropriate. Therefore, we've decided not to perform any missing value treatment on this column and leave it as is.
Column: EXT_SOURCE_3¶
# Display the first 10 rows of the 'EXT_SOURCE_3' column in the DataFrame 'train'
train['EXT_SOURCE_3'].head(10)
0 0.139376 1 NaN 2 0.729567 3 NaN 4 NaN 5 0.621226 6 0.492060 7 0.540654 8 0.751724 9 NaN Name: EXT_SOURCE_3, dtype: float64
# Plotting the distribution
train['EXT_SOURCE_3'].plot.hist()
<Axes: ylabel='Frequency'>
# Calculate the mean of the 'EXT_SOURCE_3' column in the DataFrame 'train'
train['EXT_SOURCE_3'].mean()
np.float64(0.5107185957866892)
# Calculate the mode (most frequent value) of the 'EXT_SOURCE_3' column in the DataFrame 'train'
train['EXT_SOURCE_3'].mode()
0 0.7463 Name: EXT_SOURCE_3, dtype: float64
This data is a skewed normal distribution
Column: AMT_REQ_CREDIT_BUREAU_QRT¶
# Display the first few rows of the 'AMT_REQ_CREDIT_BUREAU_QRT' column in the DataFrame 'train'
train['AMT_REQ_CREDIT_BUREAU_QRT'].head()
0 0.0 1 0.0 2 0.0 3 NaN 4 0.0 Name: AMT_REQ_CREDIT_BUREAU_QRT, dtype: float64
# Calculate the number of unique values in the 'AMT_REQ_CREDIT_BUREAU_QRT' column
num_unique_values = len(train.AMT_REQ_CREDIT_BUREAU_QRT.unique())
# Plot a histogram of the 'AMT_REQ_CREDIT_BUREAU_QRT' column with bins equal to the number of unique values
train['AMT_REQ_CREDIT_BUREAU_QRT'].plot.hist(bins=num_unique_values)
<Axes: ylabel='Frequency'>
# Count the occurrences of each unique value in the 'AMT_REQ_CREDIT_BUREAU_QRT' column
train.AMT_REQ_CREDIT_BUREAU_QRT.value_counts()
AMT_REQ_CREDIT_BUREAU_QRT 0.0 148192 1.0 23367 2.0 9889 3.0 1190 4.0 316 5.0 46 6.0 23 7.0 6 8.0 5 Name: count, dtype: int64
# Calculate the mean (average) of the values in the 'AMT_REQ_CREDIT_BUREAU_QRT' column
train['AMT_REQ_CREDIT_BUREAU_QRT'].mean()
np.float64(0.2645901854300294)
# Calculate the mode (most frequent value) of the 'AMT_REQ_CREDIT_BUREAU_QRT' column
train['AMT_REQ_CREDIT_BUREAU_QRT'].mode()
0 0.0 Name: AMT_REQ_CREDIT_BUREAU_QRT, dtype: float64
Since the column only takes discrete values, 1 or 0, so we cannot replace it by mean value. Here, we will replace it by mode.
Column: AMT_REQ_CREDIT_BUREAU_YEAR¶
# Display the first 10 rows of the 'AMT_REQ_CREDIT_BUREAU_YEAR' column in the DataFrame 'train'
train['AMT_REQ_CREDIT_BUREAU_YEAR'].head(10)
0 1.0 1 0.0 2 0.0 3 NaN 4 0.0 5 1.0 6 2.0 7 0.0 8 1.0 9 NaN Name: AMT_REQ_CREDIT_BUREAU_YEAR, dtype: float64
# Calculate the number of unique values in the 'AMT_REQ_CREDIT_BUREAU_YEAR' column
num_unique_values = len(train.AMT_REQ_CREDIT_BUREAU_YEAR.unique())
# Plot a histogram of the 'AMT_REQ_CREDIT_BUREAU_YEAR' column with bins equal to the number of unique values
train['AMT_REQ_CREDIT_BUREAU_YEAR'].plot.hist(bins=num_unique_values)
<Axes: ylabel='Frequency'>
As you can see that the values are only integers, either 0,1,2 or something else. If we will replace, we will replace by mode value and not mean value because mean value will give an decimal value.
# Calculate the mean (average) of the values in the 'AMT_REQ_CREDIT_BUREAU_YEAR' column
train['AMT_REQ_CREDIT_BUREAU_YEAR'].mean()
np.float64(1.8944840849241125)
# Calculate the mode (most frequent value) of the 'AMT_REQ_CREDIT_BUREAU_YEAR' column
train['AMT_REQ_CREDIT_BUREAU_YEAR'].mode()
0 0.0 Name: AMT_REQ_CREDIT_BUREAU_YEAR, dtype: float64
Column: AMT_REQ_CREDIT_BUREAU_WEEK¶
# Display the first 10 rows of the 'AMT_REQ_CREDIT_BUREAU_WEEK' column in the DataFrame 'train'
train['AMT_REQ_CREDIT_BUREAU_WEEK'].head(10)
0 0.0 1 0.0 2 0.0 3 NaN 4 0.0 5 0.0 6 0.0 7 0.0 8 0.0 9 NaN Name: AMT_REQ_CREDIT_BUREAU_WEEK, dtype: float64
# Calculate the number of unique values in the 'AMT_REQ_CREDIT_BUREAU_WEEK' column
num_unique_values = len(train.AMT_REQ_CREDIT_BUREAU_WEEK.unique())
# Plot a histogram of the 'AMT_REQ_CREDIT_BUREAU_WEEK' column with bins equal to the number of unique values
train['AMT_REQ_CREDIT_BUREAU_WEEK'].plot.hist(bins=num_unique_values)
<Axes: ylabel='Frequency'>
As you can see, mostly the values are 0's. So, this column is mostly acting as a constant and has no variation. So, we can ignore this column for analysis.
Column: AMT_REQ_CREDIT_BUREAU_MON¶
# Display the first 10 rows of the 'AMT_REQ_CREDIT_BUREAU_MON' column in the DataFrame 'train'
train['AMT_REQ_CREDIT_BUREAU_MON'].head(10)
0 0.0 1 0.0 2 0.0 3 NaN 4 0.0 5 0.0 6 1.0 7 0.0 8 0.0 9 NaN Name: AMT_REQ_CREDIT_BUREAU_MON, dtype: float64
# Calculate the number of unique values in the 'AMT_REQ_CREDIT_BUREAU_MON' column
num_unique_values = len(train.AMT_REQ_CREDIT_BUREAU_MON.unique())
# Plot a histogram of the 'AMT_REQ_CREDIT_BUREAU_MON' column with bins equal to the number of unique values
train['AMT_REQ_CREDIT_BUREAU_MON'].plot.hist(bins=num_unique_values)
<Axes: ylabel='Frequency'>
Again the same observation that mostly the values are 0's. So, this column is mostly acting as a constant and has no variation. So, we can ignore this column.
Column: AMT_REQ_CREDIT_BUREAU_DAY¶
# Display the first 10 rows of the 'AMT_REQ_CREDIT_BUREAU_DAY' column in the DataFrame 'train'
train['AMT_REQ_CREDIT_BUREAU_DAY'].head(10)
0 0.0 1 0.0 2 0.0 3 NaN 4 0.0 5 0.0 6 0.0 7 0.0 8 0.0 9 NaN Name: AMT_REQ_CREDIT_BUREAU_DAY, dtype: float64
# Calculate the number of unique values in the 'AMT_REQ_CREDIT_BUREAU_DAY' column
num_unique_values = len(train.AMT_REQ_CREDIT_BUREAU_DAY.unique())
# Plot a histogram of the 'AMT_REQ_CREDIT_BUREAU_DAY' column with bins equal to the number of unique values
train['AMT_REQ_CREDIT_BUREAU_DAY'].plot.hist(bins=num_unique_values)
<Axes: ylabel='Frequency'>
# Count the occurrences of each unique value in the 'AMT_REQ_CREDIT_BUREAU_DAY' column
train.AMT_REQ_CREDIT_BUREAU_DAY.value_counts()
AMT_REQ_CREDIT_BUREAU_DAY 0.0 181983 1.0 908 2.0 75 3.0 29 4.0 23 5.0 7 6.0 6 9.0 2 8.0 1 Name: count, dtype: int64
As you can see, all the values are 0's. So, we can ignore this column.
Column: AMT_REQ_CREDIT_BUREAU_HOUR¶
# Display the first 10 rows of the 'AMT_REQ_CREDIT_BUREAU_HOUR' column in the DataFrame 'train'
train['AMT_REQ_CREDIT_BUREAU_HOUR'].head(10)
0 0.0 1 0.0 2 0.0 3 NaN 4 0.0 5 0.0 6 0.0 7 0.0 8 0.0 9 NaN Name: AMT_REQ_CREDIT_BUREAU_HOUR, dtype: float64
# Calculate the number of unique values in the 'AMT_REQ_CREDIT_BUREAU_HOUR' column
num_unique_values = len(train.AMT_REQ_CREDIT_BUREAU_HOUR.unique())
# Plot a histogram of the 'AMT_REQ_CREDIT_BUREAU_HOUR' column with bins equal to the number of unique values
train['AMT_REQ_CREDIT_BUREAU_HOUR'].plot.hist(bins=num_unique_values)
<Axes: ylabel='Frequency'>
# Count the occurrences of each unique value in the 'AMT_REQ_CREDIT_BUREAU_HOUR' column
train.AMT_REQ_CREDIT_BUREAU_HOUR.value_counts()
AMT_REQ_CREDIT_BUREAU_HOUR 0.0 181908 1.0 1071 2.0 46 3.0 8 4.0 1 Name: count, dtype: int64
Again, we can ignore this column.
# Get the names of all columns in the DataFrame 'train'
train.columns.values
array(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN',
'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE',
'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED',
'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'FLAG_MOBIL',
'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE',
'FLAG_PHONE', 'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS',
'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY',
'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START',
'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY',
'ORGANIZATION_TYPE', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
'YEARS_BEGINEXPLUATATION_AVG', 'FLOORSMAX_AVG',
'YEARS_BEGINEXPLUATATION_MODE', 'FLOORSMAX_MODE',
'YEARS_BEGINEXPLUATATION_MEDI', 'FLOORSMAX_MEDI', 'TOTALAREA_MODE',
'EMERGENCYSTATE_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE',
'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE',
'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE',
'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4',
'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7',
'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10',
'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13',
'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16',
'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19',
'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21',
'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY',
'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON',
'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR'],
dtype=object)
Column: DAYS_BIRTH¶
This column represents the client's age in days relative to the application date, displayed as negative values. For clarity and ease of understanding, we are converting these values into years. We use the -round() function to transform the negative values into positive ones. We plan to apply the same transformation method to other date-related data.
# Display the first few rows of the 'DAYS_BIRTH' column in the DataFrame 'train'
train['DAYS_BIRTH'].head()
0 -9461 1 -16765 2 -19046 3 -19005 4 -19932 Name: DAYS_BIRTH, dtype: int64
# Convert the values in the 'DAYS_BIRTH' column from days to years and display the first few rows
train['DAYS_BIRTH'] = -round(train['DAYS_BIRTH'] / 365, 0)
train['DAYS_BIRTH'].head()
0 26.0 1 46.0 2 52.0 3 52.0 4 55.0 Name: DAYS_BIRTH, dtype: float64
# Convert the values in the 'DAYS_REGISTRATION' column from days to years and display the first few rows
train['DAYS_REGISTRATION'] = -round(train['DAYS_REGISTRATION'] / 365, 0)
train['DAYS_REGISTRATION'].head()
0 10.0 1 3.0 2 12.0 3 27.0 4 12.0 Name: DAYS_REGISTRATION, dtype: float64
# Convert the values in the 'DAYS_ID_PUBLISH' column from days to years and display the first few rows
train['DAYS_ID_PUBLISH'] = -round(train['DAYS_ID_PUBLISH'] / 365, 0)
train['DAYS_ID_PUBLISH'].head()
0 6.0 1 1.0 2 7.0 3 7.0 4 9.0 Name: DAYS_ID_PUBLISH, dtype: float64
# Get the data types of each column in the DataFrame 'train'
train.dtypes
SK_ID_CURR int64
TARGET int64
NAME_CONTRACT_TYPE object
CODE_GENDER object
FLAG_OWN_CAR object
...
AMT_REQ_CREDIT_BUREAU_DAY float64
AMT_REQ_CREDIT_BUREAU_WEEK float64
AMT_REQ_CREDIT_BUREAU_MON float64
AMT_REQ_CREDIT_BUREAU_QRT float64
AMT_REQ_CREDIT_BUREAU_YEAR float64
Length: 81, dtype: object
Target¶
Total users¶
# Count the number of non-null values in the 'TARGET' column of the DataFrame 'train'
train['TARGET'].count()
np.int64(211607)
Client with payment difficulties¶
# Count the number of records where the value in the 'TARGET' column is equal to 1
(train['TARGET'] == 1).sum()
np.int64(17206)
Other clients¶
# Count the number of records where the value in the 'TARGET' column is equal to 0
(train['TARGET'] == 0).sum()
np.int64(194401)
Ratio¶
# Calculate the ratio of records with 'TARGET' value equal to 0 to records with 'TARGET' value equal to 1
(train['TARGET'] == 0).sum() / (train['TARGET'] == 1).sum()
np.float64(11.298442403812624)
Too much data imbalance
Task 5 - Analysing Categorical and Numerical data.¶
Analysing categorical data¶
Since due to data imbalance, we will separate out the train with a target equal to 0 and train with a target equal to 1. And we will analyse them individually and try to find any relationship if it exists.
# Create a subset of the DataFrame 'train' containing records where the 'TARGET' column is equal to 0
train_0 = train.loc[train['TARGET'] == 0]
# Create a subset of the DataFrame 'train' containing records where the 'TARGET' column is equal to 1
train_1 = train.loc[train['TARGET'] == 1]
We utilized three distinct plots for analysis:¶
Pie plot:
- This displays all the values present in a column as percentages, where the sum of these percentages equals 100.
Countplot:
- Here, we depicted the count of different categories. Consequently, Target=0 will typically have a higher count than Target=1.
Barplot:
- To create this plot, we initially divided the dataset into two subsets based on the target variable: Target=0 and Target=1. Next, we further subdivided each subset into different categories. We then plotted these categories as percentages. Notably, you'll observe that the values for Target=0 and Target=1 are mostly equal. For a deeper understanding, please refer to the code of this plot.
def plotting(train, train0, train1, column):
"""
Plots three types of visualizations for a given column in the dataset:
a pie chart of overall distribution, a countplot by category, and a bar plot of percentage distribution by target variable.
Parameters:
- train: DataFrame containing the entire dataset.
- train0: DataFrame filtered by the target variable with value 0.
- train1: DataFrame filtered by the target variable with value 1.
- column: The name of the column to be visualized.
"""
# Assigning dataframes to local variables (This step might be redundant as we can directly use the function arguments)
train = train
train_0 = train0
train_1 = train1
col = column
# Initialize figure with a specific size
fig = plt.figure(figsize=(13,10))
# Create a subplot for the pie chart
ax1 = plt.subplot(221)
# Plotting pie chart for overall distribution of the column
train[col].value_counts().plot.pie(autopct="%1.0f%%", ax=ax1)
plt.title('Plotting data for the column: ' + column)
# Create a subplot for the countplot
ax2 = plt.subplot(222)
# Plotting count plot by category with hue as TARGET
sns.countplot(x=column, hue='TARGET', data=train, ax=ax2)
plt.xticks(rotation=90)
plt.title('Plotting data for target in terms of total count')
# Create a subplot for the bar plot
ax3 = plt.subplot(223)
# Preparing data for percentage distribution by target variable
df = pd.DataFrame()
df['0'] = ((train_0[col].value_counts())/len(train_0))
df['1'] = ((train_1[col].value_counts())/len(train_1))
# Plotting bar plot for percentage distribution
df.plot.bar(ax=ax3)
plt.title('Plotting data for target in terms of percentage')
# Adjust layout to prevent overlap
fig.tight_layout()
# Display the plots
plt.show()
# Create a list to get all the Categorical columns
train_categorical = train.select_dtypes(include=['object']).columns
train_categorical
Index(['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE',
'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE',
'EMERGENCYSTATE_MODE'],
dtype='object')
Task 6.1 - Univariate Analysis of the Categorical data¶
Visualizing Data Distribution¶
Take, for instance, the column "NAME_CONTRACT_TYPE". The visualizations produced offer insightful data distributions:
Pie Chart (First Plot):¶
- It reveals that 90% of the entries are 'Cash loans', while the remaining 10% are 'Revolving loans'. This chart provides a straightforward view of the overall category proportions within the column.
Count Plot (Second Plot):¶
- This visualization differentiates between 'Cash loans' and 'Revolving loans' based on the TARGET variable. For TARGET=0 (no default), there are approximately 250,000 'Cash loans' and around 2,500 'Revolving loans'. The plot offers a similar breakdown for TARGET=1 (default), enabling a comparison of counts between the two TARGET categories.
Percentage Bar Plot (Third Plot):¶
- In this plot, the data is presented in terms of percentage, focusing on the relative distribution within each TARGET category. The 'Cash loans' and 'Revolving loans' are stacked (in blue for TARGET=0 and orange for TARGET=1), with the sum of the percentages for each loan type within a TARGET category equating to 100%. This plot emphasizes the proportionate representation of each loan type within the groups defined by the TARGET variable.
# Convert 'TARGET' to a string data type for categorical processing.
train['TARGET'] = train['TARGET'].astype(str)
# Change 'NAME_CONTRACT_TYPE' to a categorical type for optimized storage and plotting.
train['NAME_CONTRACT_TYPE'] = train['NAME_CONTRACT_TYPE'].astype('category')
# Iterate through a list of categorical column names, creating plots for each.
for column in train_categorical:
# Indicates the column currently being plotted.
print("Plotting ", column)
# Generate visualizations for the current column against the TARGET variable.
plotting(train, train_0, train_1, column)
# Prints a separator line for readability between plots of different columns.
print('----------------------------------------------------------------------------------------------')
Plotting NAME_CONTRACT_TYPE
---------------------------------------------------------------------------------------------- Plotting CODE_GENDER
---------------------------------------------------------------------------------------------- Plotting FLAG_OWN_CAR
---------------------------------------------------------------------------------------------- Plotting FLAG_OWN_REALTY
---------------------------------------------------------------------------------------------- Plotting NAME_TYPE_SUITE
---------------------------------------------------------------------------------------------- Plotting NAME_INCOME_TYPE
---------------------------------------------------------------------------------------------- Plotting NAME_EDUCATION_TYPE
---------------------------------------------------------------------------------------------- Plotting NAME_FAMILY_STATUS
---------------------------------------------------------------------------------------------- Plotting NAME_HOUSING_TYPE
---------------------------------------------------------------------------------------------- Plotting OCCUPATION_TYPE
---------------------------------------------------------------------------------------------- Plotting WEEKDAY_APPR_PROCESS_START
---------------------------------------------------------------------------------------------- Plotting ORGANIZATION_TYPE
---------------------------------------------------------------------------------------------- Plotting EMERGENCYSTATE_MODE
----------------------------------------------------------------------------------------------
The analysis identifies several key variables that significantly influence loan behavior and default rates:¶
CODE_GENDER:¶
- Although loans are less frequently taken by males, the data indicates that the proportion of male defaulters is notably higher compared to females. This suggests gender as a potential factor in assessing loan risk.
NAME_INCOME_TYPE:¶
- Pensioners show a lower default rate, implying that despite a potentially fixed or lower income, they maintain a higher reliability in repaying loans.
NAME_EDUCATION_TYPE:¶
- Loans are predominantly sought for secondary education, followed by higher education. However, the default rate among borrowers with secondary education is substantially higher than those with higher education credentials. This points to education level as a significant determinant of loan repayment capacity.
NAME_FAMILY_STATUS:¶
- Married individuals are the most common applicants for loans and tend to have lower default rates. In contrast, singles and those in civil marriages exhibit higher default rates, highlighting the influence of marital status on loan repayment behavior.
OCCUPATION_TYPE:¶
- Laborers and various staff categories are the most frequent loan applicants, yet managers and high-skilled tech staff prove to be the most reliable in terms of repayment. This underscores the role of occupation type in predicting loan default risk.
Analyzing Numerical columns¶
Non-defaulter correlation¶
# Filter only numeric columns from train_0 for correlation analysis.
numeric_cols = train_0.select_dtypes(include=[np.number]).columns
train_0_numeric = train_0[numeric_cols]
# Handle missing values, if necessary, by filling them with the median of each column.
# train_0_numeric = train_0_numeric.fillna(train_0_numeric.median())
# Calculate the correlation matrix for the numeric columns.
corr = train_0_numeric.corr()
# Prepare a mask to hide the upper triangle of the heatmap, for readability.
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
# Plotting setup: define the figure size and style.
f, ax = plt.subplots(figsize=(11, 9))
with sns.axes_style("white"):
# Create a heatmap with the correlation matrix, masked upper triangle, and specific aesthetic choices.
ax = sns.heatmap(corr, mask=mask, vmax=.3, square=True)
This method starts by narrowing down your DataFrame to only include columns with numeric data. Following this, it computes the correlation matrix using the filtered DataFrame.
Finding the top 10 correlation¶
# Select only the numeric columns from the DataFrame
numeric_df = train_0.select_dtypes(include=[np.number])
# Optionally, you can handle missing values if necessary, for example, by filling them
# numeric_df = numeric_df.fillna(method='ffill').fillna(method='bfill')
# Now calculate the absolute correlation matrix for the numeric columns
correlation_0 = numeric_df.corr().abs()
# Unstack and sort the correlation pairs
correlation_0 = correlation_0.unstack().sort_values(kind="quicksort", ascending=False)
# Optionally, drop NA values if there were any missing correlations due to missing data
correlation_0 = correlation_0.dropna()
# Display or process your sorted correlation pairs
print(correlation_0)
FLAG_EMAIL FLAG_EMAIL 1.000000
AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_QRT 1.000000
AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_MON 1.000000
DAYS_ID_PUBLISH DAYS_ID_PUBLISH 1.000000
FLOORSMAX_MEDI FLOORSMAX_MEDI 1.000000
...
FLAG_DOCUMENT_2 FLAG_MOBIL 0.000012
FLAG_DOCUMENT_9 AMT_REQ_CREDIT_BUREAU_WEEK 0.000010
AMT_REQ_CREDIT_BUREAU_WEEK FLAG_DOCUMENT_9 0.000010
FLAG_DOCUMENT_12 FLAG_MOBIL 0.000007
FLAG_MOBIL FLAG_DOCUMENT_12 0.000007
Length: 4457, dtype: float64
Top correlations¶
''' EXT_SOURCE_1 DAYS_BIRTH 0.601210
DAYS_EMPLOYED DAYS_BIRTH 0.618048
AMT_CREDIT AMT_ANNUITY 0.771309
AMT_GOODS_PRICE AMT_ANNUITY 0.776686
LIVE_CITY_NOT_WORK_CITY REG_CITY_NOT_WORK_CITY 0.830381
LIVE_REGION_NOT_WORK_REGION REG_REGION_NOT_WORK_REGION 0.861861
CNT_FAM_MEMBERS CNT_CHILDREN 0.878571
REGION_RATING_CLIENT_W_CITY REGION_RATING_CLIENT 0.950149
AMT_CREDIT AMT_GOODS_PRICE 0.987250
DAYS_EMPLOYED FLAG_EMP_PHONE 0.999758 '''
Top correlations for the defaulter¶
''' EXT_SOURCE_1 DAYS_BIRTH 0.570054
DAYS_EMPLOYED DAYS_BIRTH 0.575097
FLAG_EMP_PHONE DAYS_BIRTH 0.578519
AMT_CREDIT AMT_ANNUITY 0.752195
AMT_GOODS_PRICE AMT_ANNUITY 0.752699
REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY 0.778540
REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION 0.847885
CNT_CHILDREN CNT_FAM_MEMBERS 0.885484
REGION_RATING_CLIENT_W_CITY REGION_RATING_CLIENT 0.956637
AMT_CREDIT AMT_GOODS_PRICE 0.983103
DAYS_EMPLOYED FLAG_EMP_PHONE 0.999702 '''
Defaulter correlation¶
# Step 1: Select only numeric columns from the DataFrame
numeric_train_1 = train_1.select_dtypes(include=[np.number])
# Step 2: Optionally handle missing values. For example, you could fill them with zeros
# numeric_train_1.fillna(0, inplace=True)
# Now you can safely calculate the correlation matrix
corr = numeric_train_1.corr()
# If you wish to continue with masking the upper triangle for plotting:
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(11, 9))
with sns.axes_style("white"):
ax = sns.heatmap(corr, mask=mask, vmax=.3, square=True)
# Select only numeric columns from 'train_1'
numeric_train_1 = train_1.select_dtypes(include=[np.number])
# Optional: Handle missing values. For example, you can fill them with the column mean
# numeric_train_1.fillna(numeric_train_1.mean(), inplace=True)
# Now, compute the absolute correlation matrix
correlation_1 = numeric_train_1.corr().abs()
# If you want to unstack and sort the correlation pairs
correlation_pairs = correlation_1.unstack().sort_values(kind="quicksort", ascending=False)
# Optional: Drop NaN values if any exist
correlation_pairs = correlation_pairs.dropna()
correlation_1 = correlation_1[correlation_1 != 1.0]
print(correlation_1)
SK_ID_CURR TARGET CNT_CHILDREN \
SK_ID_CURR NaN NaN 0.004849
TARGET NaN NaN NaN
CNT_CHILDREN 0.004849 NaN NaN
AMT_INCOME_TOTAL 0.011458 NaN 0.006184
AMT_CREDIT 0.005594 NaN 0.005252
... ... ... ...
AMT_REQ_CREDIT_BUREAU_DAY 0.013174 NaN 0.016030
AMT_REQ_CREDIT_BUREAU_WEEK 0.007371 NaN 0.012206
AMT_REQ_CREDIT_BUREAU_MON 0.013882 NaN 0.013042
AMT_REQ_CREDIT_BUREAU_QRT 0.004590 NaN 0.016741
AMT_REQ_CREDIT_BUREAU_YEAR 0.008739 NaN 0.037146
AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY \
SK_ID_CURR 0.011458 0.005594 0.001134
TARGET NaN NaN NaN
CNT_CHILDREN 0.006184 0.005252 0.039927
AMT_INCOME_TOTAL NaN 0.031498 0.037984
AMT_CREDIT 0.031498 NaN 0.748706
... ... ... ...
AMT_REQ_CREDIT_BUREAU_DAY 0.000669 0.004442 0.001187
AMT_REQ_CREDIT_BUREAU_WEEK 0.000800 0.004589 0.029471
AMT_REQ_CREDIT_BUREAU_MON 0.003327 0.048590 0.055094
AMT_REQ_CREDIT_BUREAU_QRT 0.001764 0.011945 0.006426
AMT_REQ_CREDIT_BUREAU_YEAR 0.000334 0.027330 0.013669
AMT_GOODS_PRICE REGION_POPULATION_RELATIVE \
SK_ID_CURR 0.004222 0.007247
TARGET NaN NaN
CNT_CHILDREN 0.001340 0.032758
AMT_INCOME_TOTAL 0.030594 0.004610
AMT_CREDIT 0.982987 0.062132
... ... ...
AMT_REQ_CREDIT_BUREAU_DAY 0.005247 0.009722
AMT_REQ_CREDIT_BUREAU_WEEK 0.006120 0.000390
AMT_REQ_CREDIT_BUREAU_MON 0.051123 0.075086
AMT_REQ_CREDIT_BUREAU_QRT 0.011643 0.003196
AMT_REQ_CREDIT_BUREAU_YEAR 0.029737 0.002434
DAYS_BIRTH DAYS_EMPLOYED ... FLAG_DOCUMENT_18 \
SK_ID_CURR 0.005229 0.002068 ... 0.000728
TARGET NaN NaN ... NaN
CNT_CHILDREN 0.261010 0.195754 ... 0.001622
AMT_INCOME_TOTAL 0.003528 0.013976 ... 0.002448
AMT_CREDIT 0.132601 0.001101 ... 0.001719
... ... ... ... ...
AMT_REQ_CREDIT_BUREAU_DAY 0.007128 0.022035 ... 0.001786
AMT_REQ_CREDIT_BUREAU_WEEK 0.004460 0.012727 ... 0.014177
AMT_REQ_CREDIT_BUREAU_MON 0.006219 0.031927 ... 0.004830
AMT_REQ_CREDIT_BUREAU_QRT 0.016899 0.014287 ... 0.006451
AMT_REQ_CREDIT_BUREAU_YEAR 0.091058 0.047222 ... 0.049998
FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 \
SK_ID_CURR 0.002898 0.000217
TARGET NaN NaN
CNT_CHILDREN 0.002098 0.006140
AMT_INCOME_TOTAL 0.000517 0.000082
AMT_CREDIT 0.007285 0.017112
... ... ...
AMT_REQ_CREDIT_BUREAU_DAY 0.001881 0.001774
AMT_REQ_CREDIT_BUREAU_WEEK 0.004350 0.004101
AMT_REQ_CREDIT_BUREAU_MON 0.007804 0.005118
AMT_REQ_CREDIT_BUREAU_QRT 0.010317 0.008942
AMT_REQ_CREDIT_BUREAU_YEAR 0.013165 0.013935
FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR \
SK_ID_CURR 0.009651 0.017873
TARGET NaN NaN
CNT_CHILDREN 0.004520 0.001600
AMT_INCOME_TOTAL 0.000652 0.000085
AMT_CREDIT 0.017567 0.002464
... ... ...
AMT_REQ_CREDIT_BUREAU_DAY 0.001659 0.243735
AMT_REQ_CREDIT_BUREAU_WEEK 0.003836 0.001391
AMT_REQ_CREDIT_BUREAU_MON 0.006882 0.002580
AMT_REQ_CREDIT_BUREAU_QRT 0.005870 0.007778
AMT_REQ_CREDIT_BUREAU_YEAR 0.001417 0.008381
AMT_REQ_CREDIT_BUREAU_DAY \
SK_ID_CURR 0.013174
TARGET NaN
CNT_CHILDREN 0.016030
AMT_INCOME_TOTAL 0.000669
AMT_CREDIT 0.004442
... ...
AMT_REQ_CREDIT_BUREAU_DAY NaN
AMT_REQ_CREDIT_BUREAU_WEEK 0.192201
AMT_REQ_CREDIT_BUREAU_MON 0.011443
AMT_REQ_CREDIT_BUREAU_QRT 0.003838
AMT_REQ_CREDIT_BUREAU_YEAR 0.004587
AMT_REQ_CREDIT_BUREAU_WEEK \
SK_ID_CURR 0.007371
TARGET NaN
CNT_CHILDREN 0.012206
AMT_INCOME_TOTAL 0.000800
AMT_CREDIT 0.004589
... ...
AMT_REQ_CREDIT_BUREAU_DAY 0.192201
AMT_REQ_CREDIT_BUREAU_WEEK NaN
AMT_REQ_CREDIT_BUREAU_MON 0.010031
AMT_REQ_CREDIT_BUREAU_QRT 0.006023
AMT_REQ_CREDIT_BUREAU_YEAR 0.022764
AMT_REQ_CREDIT_BUREAU_MON \
SK_ID_CURR 0.013882
TARGET NaN
CNT_CHILDREN 0.013042
AMT_INCOME_TOTAL 0.003327
AMT_CREDIT 0.048590
... ...
AMT_REQ_CREDIT_BUREAU_DAY 0.011443
AMT_REQ_CREDIT_BUREAU_WEEK 0.010031
AMT_REQ_CREDIT_BUREAU_MON NaN
AMT_REQ_CREDIT_BUREAU_QRT 0.002143
AMT_REQ_CREDIT_BUREAU_YEAR 0.001196
AMT_REQ_CREDIT_BUREAU_QRT \
SK_ID_CURR 0.004590
TARGET NaN
CNT_CHILDREN 0.016741
AMT_INCOME_TOTAL 0.001764
AMT_CREDIT 0.011945
... ...
AMT_REQ_CREDIT_BUREAU_DAY 0.003838
AMT_REQ_CREDIT_BUREAU_WEEK 0.006023
AMT_REQ_CREDIT_BUREAU_MON 0.002143
AMT_REQ_CREDIT_BUREAU_QRT NaN
AMT_REQ_CREDIT_BUREAU_YEAR 0.094318
AMT_REQ_CREDIT_BUREAU_YEAR
SK_ID_CURR 0.008739
TARGET NaN
CNT_CHILDREN 0.037146
AMT_INCOME_TOTAL 0.000334
AMT_CREDIT 0.027330
... ...
AMT_REQ_CREDIT_BUREAU_DAY 0.004587
AMT_REQ_CREDIT_BUREAU_WEEK 0.022764
AMT_REQ_CREDIT_BUREAU_MON 0.001196
AMT_REQ_CREDIT_BUREAU_QRT 0.094318
AMT_REQ_CREDIT_BUREAU_YEAR NaN
[68 rows x 68 columns]
Analysing through box plot¶
# Select columns with numerical data types ('int64' and 'float64') from the DataFrame 'train'
train_categorical = train.select_dtypes(include=['int64', 'float64']).columns
Task 6.2 - Univariate Analysis for numerical data¶
For univariate analysis of the numerical columns, we will plot the histogram and the distribution plot.
# Iterate over each categorical column to generate plots
for column in train_categorical:
# Construct and print the title for current plot
title = "Plot of " + column
print(title)
# Plot histograms for the distribution of the variable for both TARGET categories
plt.hist(train_0[column], alpha=0.5, label='Target=0') # Histogram for category '0'
plt.hist(train_1[column], alpha=0.5, label='Target=1') # Histogram for category '1'
plt.legend() # Add a legend to distinguish between categories
plt.show() # Display the histogram
# Plot distribution plots for the non-null values in both TARGET categories
sns.distplot(train_0[column].dropna(), label='Target=0', kde=False, norm_hist=True) # Distribution plot for '0'
sns.distplot(train_1[column].dropna(), label='Target=1', kde=False, norm_hist=True) # Distribution plot for '1'
plt.legend() # Add a legend to distinguish between categories
plt.show() # Display the distribution plot
# Placeholder for a box plot function that might be defined elsewhere
# box_plot(train_0, train_1, column)
# Print a separator for readability between plots
print("------------------------------------------------------------------------")
Plot of SK_ID_CURR
------------------------------------------------------------------------ Plot of CNT_CHILDREN
------------------------------------------------------------------------ Plot of AMT_INCOME_TOTAL
------------------------------------------------------------------------ Plot of AMT_CREDIT
------------------------------------------------------------------------ Plot of AMT_ANNUITY
------------------------------------------------------------------------ Plot of AMT_GOODS_PRICE
------------------------------------------------------------------------ Plot of REGION_POPULATION_RELATIVE
------------------------------------------------------------------------ Plot of DAYS_BIRTH
------------------------------------------------------------------------ Plot of DAYS_EMPLOYED
------------------------------------------------------------------------ Plot of DAYS_REGISTRATION
------------------------------------------------------------------------ Plot of DAYS_ID_PUBLISH
------------------------------------------------------------------------ Plot of FLAG_MOBIL
------------------------------------------------------------------------ Plot of FLAG_EMP_PHONE
------------------------------------------------------------------------ Plot of FLAG_WORK_PHONE
------------------------------------------------------------------------ Plot of FLAG_CONT_MOBILE
------------------------------------------------------------------------ Plot of FLAG_PHONE
------------------------------------------------------------------------ Plot of FLAG_EMAIL
------------------------------------------------------------------------ Plot of CNT_FAM_MEMBERS
------------------------------------------------------------------------ Plot of REGION_RATING_CLIENT
------------------------------------------------------------------------ Plot of REGION_RATING_CLIENT_W_CITY
------------------------------------------------------------------------ Plot of HOUR_APPR_PROCESS_START
------------------------------------------------------------------------ Plot of REG_REGION_NOT_LIVE_REGION
------------------------------------------------------------------------ Plot of REG_REGION_NOT_WORK_REGION
------------------------------------------------------------------------ Plot of LIVE_REGION_NOT_WORK_REGION
------------------------------------------------------------------------ Plot of REG_CITY_NOT_LIVE_CITY
------------------------------------------------------------------------ Plot of REG_CITY_NOT_WORK_CITY
------------------------------------------------------------------------ Plot of LIVE_CITY_NOT_WORK_CITY
------------------------------------------------------------------------ Plot of EXT_SOURCE_2
------------------------------------------------------------------------ Plot of EXT_SOURCE_3
------------------------------------------------------------------------ Plot of YEARS_BEGINEXPLUATATION_AVG
------------------------------------------------------------------------ Plot of FLOORSMAX_AVG
------------------------------------------------------------------------ Plot of YEARS_BEGINEXPLUATATION_MODE
------------------------------------------------------------------------ Plot of FLOORSMAX_MODE
------------------------------------------------------------------------ Plot of YEARS_BEGINEXPLUATATION_MEDI
------------------------------------------------------------------------ Plot of FLOORSMAX_MEDI
------------------------------------------------------------------------ Plot of TOTALAREA_MODE
------------------------------------------------------------------------ Plot of OBS_30_CNT_SOCIAL_CIRCLE
------------------------------------------------------------------------ Plot of DEF_30_CNT_SOCIAL_CIRCLE
------------------------------------------------------------------------ Plot of OBS_60_CNT_SOCIAL_CIRCLE
------------------------------------------------------------------------ Plot of DEF_60_CNT_SOCIAL_CIRCLE
------------------------------------------------------------------------ Plot of DAYS_LAST_PHONE_CHANGE
------------------------------------------------------------------------ Plot of FLAG_DOCUMENT_2
------------------------------------------------------------------------ Plot of FLAG_DOCUMENT_3
------------------------------------------------------------------------ Plot of FLAG_DOCUMENT_4
------------------------------------------------------------------------ Plot of FLAG_DOCUMENT_5
------------------------------------------------------------------------ Plot of FLAG_DOCUMENT_6
------------------------------------------------------------------------ Plot of FLAG_DOCUMENT_7
------------------------------------------------------------------------ Plot of FLAG_DOCUMENT_8
------------------------------------------------------------------------ Plot of FLAG_DOCUMENT_9
------------------------------------------------------------------------ Plot of FLAG_DOCUMENT_10
------------------------------------------------------------------------ Plot of FLAG_DOCUMENT_11
------------------------------------------------------------------------ Plot of FLAG_DOCUMENT_12
------------------------------------------------------------------------ Plot of FLAG_DOCUMENT_13
------------------------------------------------------------------------ Plot of FLAG_DOCUMENT_14
------------------------------------------------------------------------ Plot of FLAG_DOCUMENT_15
------------------------------------------------------------------------ Plot of FLAG_DOCUMENT_16
------------------------------------------------------------------------ Plot of FLAG_DOCUMENT_17
------------------------------------------------------------------------ Plot of FLAG_DOCUMENT_18
------------------------------------------------------------------------ Plot of FLAG_DOCUMENT_19
------------------------------------------------------------------------ Plot of FLAG_DOCUMENT_20
------------------------------------------------------------------------ Plot of FLAG_DOCUMENT_21
------------------------------------------------------------------------ Plot of AMT_REQ_CREDIT_BUREAU_HOUR
------------------------------------------------------------------------ Plot of AMT_REQ_CREDIT_BUREAU_DAY
------------------------------------------------------------------------ Plot of AMT_REQ_CREDIT_BUREAU_WEEK
------------------------------------------------------------------------ Plot of AMT_REQ_CREDIT_BUREAU_MON
------------------------------------------------------------------------ Plot of AMT_REQ_CREDIT_BUREAU_QRT
------------------------------------------------------------------------ Plot of AMT_REQ_CREDIT_BUREAU_YEAR
------------------------------------------------------------------------
The columns which seems different where both the histogram and distribution plot are different for the target = 0 and target = 1 are:
- AMT_CREDIT:
- AMT_ANNUITY:
- AMT_GOODS_PRICE
- DAYS_BIRTH
- HOURS_APPR_PROCESS_START
- EXT_SOUCE_2
- EXT_SOURCE_3
- AMT_REQ_CREDIT_BUREAU_YEAR
Reading previous application¶
# Read the data from the file 'previous_application.csv' into a DataFrame
previous_application = pd.read_csv('previous_application.csv')
# Display the first few rows of the DataFrame 'previous_application'
previous_application.head()
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | ... | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430 | 17145.0 | 17145.0 | 0.0 | 17145.0 | SATURDAY | 15 | ... | Connectivity | 12.0 | middle | POS mobile with interest | 365243.0 | -42.0 | 300.0 | -42.0 | -37.0 | 0.0 |
| 1 | 2802425 | 108129 | Cash loans | 25188.615 | 607500.0 | 679671.0 | NaN | 607500.0 | THURSDAY | 11 | ... | XNA | 36.0 | low_action | Cash X-Sell: low | 365243.0 | -134.0 | 916.0 | 365243.0 | 365243.0 | 1.0 |
| 2 | 2523466 | 122040 | Cash loans | 15060.735 | 112500.0 | 136444.5 | NaN | 112500.0 | TUESDAY | 11 | ... | XNA | 12.0 | high | Cash X-Sell: high | 365243.0 | -271.0 | 59.0 | 365243.0 | 365243.0 | 1.0 |
| 3 | 2819243 | 176158 | Cash loans | 47041.335 | 450000.0 | 470790.0 | NaN | 450000.0 | MONDAY | 7 | ... | XNA | 12.0 | middle | Cash X-Sell: middle | 365243.0 | -482.0 | -152.0 | -182.0 | -177.0 | 1.0 |
| 4 | 1784265 | 202054 | Cash loans | 31924.395 | 337500.0 | 404055.0 | NaN | 337500.0 | THURSDAY | 9 | ... | XNA | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 37 columns
# Shape of previous application
previous_application.shape
(1670214, 37)
There are duplicate 'SK_ID_CURR' as a person could have taken loan multiple times¶
# Number of unique id in previous application
previous_application.SK_ID_PREV.value_counts()
SK_ID_PREV
2418762 1
2030495 1
2802425 1
2523466 1
2819243 1
..
2579447 1
1715995 1
2257824 1
2330894 1
1397919 1
Name: count, Length: 1670214, dtype: int64
# Number of unique id in previous application
previous_application.SK_ID_CURR.value_counts()
SK_ID_CURR
187868 77
265681 73
173680 72
242412 68
206783 67
..
123460 1
353874 1
302277 1
316831 1
363167 1
Name: count, Length: 338857, dtype: int64
From the information provided, the shape of the previous application dataset is (1670214, 37), and the count of unique SK_ID_PREV entries matches this, standing at 1670214. However, the count of unique SK_ID_CURR entries is significantly lower, at 338857. This discrepancy indicates that the SK_ID_PREV values contain duplicates, suggesting that multiple entries pertain to the same current application ID (SK_ID_CURR).
Task 7 - Analysis for the outliers¶
Potting the numerial data based on the index and analysing if there are outliers in any of the column.
# Plot each numerical column against the index of the DataFrame 'train'
for column in train_categorical:
title = "Plot of " + column
plt.scatter(train.index, train[column])
plt.title(title)
plt.show()
The examination of outliers primarily hinges on the specifics of the business problem. Assuming the data is error-free, the decision to exclude any extreme values is dictated by business requirements. A review of various plots indicates the data reporting appears accurate. Hence, rather than eliminating these points, we will highlight the columns containing extreme values for further analysis.
Columns warranting further investigation include:
CNT_CHILDREN:
- This denotes the number of children a client has. Instances of clients with more than 10 children are considered extreme.
AMT_INCOME_TOTAL:
- This represents the client's income. A particular value around 1.2e8 is notably distant from the rest, suggesting it may be an outlier suitable for removal.
FLAG_MOBILE:
- This identifies whether a client possesses a mobile phone. Its absence may or may not be considered an outlier. Other columns such as OBS_30_CNT_SOCIAL_CIRCLE, DEF_30_CNT_SOCIAL_CIRCLE, and OBS_60_CNT_SOCIAL_CIRCLE also merit attention.
Converting a numerial data to categorical for analysis¶
# Plot a distribution plot for the 'AMT_ANNUITY' column after removing any missing values
sns.distplot(train['AMT_ANNUITY'].dropna())
<Axes: xlabel='AMT_ANNUITY', ylabel='Density'>
def amt_annuity(x):
if x <= 20000:
return 'low'
elif x > 20000 and x <= 50000:
return 'medium'
elif x > 50000 and x <= 100000:
return 'high'
else:
return 'very high'
# Apply the 'amt_annuity' function to create a new categorical column 'amt_annuity_categorical'
train['amt_annuity_categorical'] = train['AMT_ANNUITY'].apply(lambda x: amt_annuity(x))
# Plot a count plot of the 'amt_annuity_categorical' column
sns.countplot(x='amt_annuity_categorical', data=train)
<Axes: xlabel='amt_annuity_categorical', ylabel='count'>
Task 8 - Merging Datasets¶
Merging DataFrames: Train and Previous Application Based on SK_ID_PREV¶
After merging both dataframes using the SK_ID_PREV column as the key, the resulting dataframe will also contain duplicate SK_ID_PREV values. This duplication is not an issue, as our objective is to explore patterns, including cases where a lender has previously taken out a loan more than once. Retaining these duplicates allows us to analyze the data comprehensively and identify any recurring trends or behaviors among borrowers with multiple loan histories.
# Merge 'train' DataFrame with 'previous_application' DataFrame based on 'SK_ID_CURR'
# Using 'inner' join to retain only common rows between the two DataFrames
previous_train = train.merge(previous_application, left_on='SK_ID_CURR', right_on='SK_ID_CURR', how='inner')
previous_train.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE_x | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT_x | AMT_ANNUITY_x | ... | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | Auto technology | 24.0 | low_normal | POS other with interest | 365243.0 | -565.0 | 125.0 | -25.0 | -17.0 | 0.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | XNA | 12.0 | low_normal | Cash X-Sell: low | 365243.0 | -716.0 | -386.0 | -536.0 | -527.0 | 1.0 |
| 2 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | Furniture | 6.0 | middle | POS industry with interest | 365243.0 | -797.0 | -647.0 | -647.0 | -639.0 | 0.0 |
| 3 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | Consumer electronics | 12.0 | middle | POS household with interest | 365243.0 | -2310.0 | -1980.0 | -1980.0 | -1976.0 | 1.0 |
| 4 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | Connectivity | 4.0 | middle | POS mobile without interest | 365243.0 | -784.0 | -694.0 | -724.0 | -714.0 | 0.0 |
5 rows × 118 columns
# Get the names of all columns in the DataFrame 'previous_application'
previous_application.columns.values
array(['SK_ID_PREV', 'SK_ID_CURR', 'NAME_CONTRACT_TYPE', 'AMT_ANNUITY',
'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT',
'AMT_GOODS_PRICE', 'WEEKDAY_APPR_PROCESS_START',
'HOUR_APPR_PROCESS_START', 'FLAG_LAST_APPL_PER_CONTRACT',
'NFLAG_LAST_APPL_IN_DAY', 'RATE_DOWN_PAYMENT',
'RATE_INTEREST_PRIMARY', 'RATE_INTEREST_PRIVILEGED',
'NAME_CASH_LOAN_PURPOSE', 'NAME_CONTRACT_STATUS', 'DAYS_DECISION',
'NAME_PAYMENT_TYPE', 'CODE_REJECT_REASON', 'NAME_TYPE_SUITE',
'NAME_CLIENT_TYPE', 'NAME_GOODS_CATEGORY', 'NAME_PORTFOLIO',
'NAME_PRODUCT_TYPE', 'CHANNEL_TYPE', 'SELLERPLACE_AREA',
'NAME_SELLER_INDUSTRY', 'CNT_PAYMENT', 'NAME_YIELD_GROUP',
'PRODUCT_COMBINATION', 'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE',
'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_LAST_DUE', 'DAYS_TERMINATION',
'NFLAG_INSURED_ON_APPROVAL'], dtype=object)
The merged dataframe also has multiple values for SK_ID_CURR¶
# Count the occurrences of each unique value in the 'SK_ID_CURR' column of the DataFrame 'previous_application' and display the top values
previous_application['SK_ID_CURR'].value_counts().head()
SK_ID_CURR 187868 77 265681 73 173680 72 242412 68 206783 67 Name: count, dtype: int64
Segregating the dataset on Target = 0 and Target = 1¶
# Create a subset of the DataFrame 'train' containing records where the 'TARGET' column is equal to '0'
train_0 = train.loc[train['TARGET'] == '0']
# Create a subset of the DataFrame 'train' containing records where the 'TARGET' column is equal to '1'
train_1 = train.loc[train['TARGET'] == '1']
# Create a subset of the DataFrame 'previous_train' containing records where the 'TARGET' column is equal to '0'
ptrain_0 = previous_train.loc[previous_train['TARGET'] == '0']
# Create a subset of the DataFrame 'previous_train' containing records where the 'TARGET' column is equal to '1'
ptrain_1 = previous_train.loc[previous_train['TARGET'] == '1']
Plotting data¶
def plotting(column, hue):
# Assign column and hue parameters to local variables
col = column
hue = hue
# Create a figure for the plots with a specific size
fig = plt.figure(figsize=(13,10))
# Subplot 1: Pie chart showing the distribution of values in the column
ax1 = plt.subplot(221)
train[col].value_counts().plot.pie(autopct="%1.0f%%", ax=ax1)
plt.title('Distribution of values for the column: '+ column)
# Subplot 2: Bar plot displaying the distribution of values by target categories
ax2 = plt.subplot(222)
df = pd.DataFrame()
df['0'] = ((train_0[col].value_counts()) / len(train_0))
df['1'] = ((train_1[col].value_counts()) / len(train_1))
df.plot.bar(ax=ax2)
plt.title('Distribution of values by target category')
# Subplot 3: Count plot showing the distribution of values for Target=0
ax3 = plt.subplot(223)
sns.countplot(x=col, hue=hue, data=ptrain_0, ax=ax3)
plt.xticks(rotation=90)
plt.title('Distribution of values for Target=0')
# Subplot 4: Count plot showing the distribution of values for Target=1
ax4 = plt.subplot(224)
sns.countplot(x=col, hue=hue, data=ptrain_1, ax=ax4)
plt.xticks(rotation=90)
plt.title('Distribution of values for Target=1')
# Adjust layout to prevent overlap
fig.tight_layout()
# Display the plots
plt.show()
Task 9 - Bivariate Analysis¶
Here, plotting only for 3 columns, as plotting in loop for all columns was pretty heavy for this size of dataset.
Plotting NAME_EDUCATION_TYPE¶
plotting('NAME_EDUCATION_TYPE','NAME_CONTRACT_STATUS')
Observation: People tend to make more loan for 'Secondary special' and their loan is also approved.
plotting('NAME_FAMILY_STATUS','NAME_CONTRACT_STATUS')
Observation: Married people are more likely to pay back their loans on time compared to single people, as indicated by the differences in loan approval statuses.
Plotting: NAME_HOUSING_TYPE¶
plotting('NAME_HOUSING_TYPE','NAME_CONTRACT_STATUS')
Observation: There's a distinct variation among the approval statuses ("Approved, Refused, Unused, and Cancelled") concerning the "House/apartment" category.
Plotting: ORGANIZATION_TYPE¶
# Plot a pie chart to visualize the distribution of values in the 'ORGANIZATION_TYPE' column
col = 'ORGANIZATION_TYPE'
train[col].value_counts().plot.pie(autopct="%1.0f%%")
plt.title("Distribution of " + col)
plt.show()
value_counts = train[col].value_counts(normalize=True) * 100 # Calculate percentage of each value count
# Filter out value counts that are more than 2 percent
value_counts_filtered = value_counts[value_counts > 2]
# Plot pie chart only if there are values with more than 2 percent count
if not value_counts_filtered.empty:
value_counts_filtered.plot.pie(autopct="%1.0f%%")
plt.title(f"Distribution of {col} (more than 2%)")
plt.show()
# Define the column of interest
col = 'ORGANIZATION_TYPE'
# Create an empty DataFrame to hold the proportions of each category for each target group
df = pd.DataFrame()
# Calculate the proportions of each category for Target=0 and Target=1 and store them in the DataFrame
df['0'] = train_0[col].value_counts(normalize=True) # Proportions for Target=0
df['1'] = train_1[col].value_counts(normalize=True) # Proportions for Target=1
# Set the figure size for the bar plot
sns.set(rc={'figure.figsize':(15,5)})
# Plot a bar plot showing the proportions of each category for both Target=0 and Target=1
df.plot.bar()
<Axes: xlabel='ORGANIZATION_TYPE'>
# Define the column of interest
col = 'ORGANIZATION_TYPE'
# Define the hue column for differentiation
hue = 'NAME_CONTRACT_STATUS'
# Create a count plot showing the distribution of values in 'col' column, differentiated by 'hue' column, for Target=0
sns.countplot(x=col, hue=hue, data=ptrain_0)
# Rotate the x-axis labels for better readability
plt.xticks(rotation=90)
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57], [Text(0, 0, 'School'), Text(1, 0, 'Government'), Text(2, 0, 'Business Entity Type 3'), Text(3, 0, 'Religion'), Text(4, 0, 'Other'), Text(5, 0, 'XNA'), Text(6, 0, 'Electricity'), Text(7, 0, 'Medicine'), Text(8, 0, 'Business Entity Type 2'), Text(9, 0, 'Self-employed'), Text(10, 0, 'Transport: type 2'), Text(11, 0, 'Construction'), Text(12, 0, 'Housing'), Text(13, 0, 'Kindergarten'), Text(14, 0, 'Trade: type 7'), Text(15, 0, 'Industry: type 11'), Text(16, 0, 'Military'), Text(17, 0, 'Services'), Text(18, 0, 'Security Ministries'), Text(19, 0, 'Transport: type 4'), Text(20, 0, 'Industry: type 1'), Text(21, 0, 'Emergency'), Text(22, 0, 'Security'), Text(23, 0, 'Trade: type 2'), Text(24, 0, 'University'), Text(25, 0, 'Police'), Text(26, 0, 'Business Entity Type 1'), Text(27, 0, 'Postal'), Text(28, 0, 'Transport: type 3'), Text(29, 0, 'Agriculture'), Text(30, 0, 'Restaurant'), Text(31, 0, 'Culture'), Text(32, 0, 'Hotel'), Text(33, 0, 'Industry: type 7'), Text(34, 0, 'Industry: type 3'), Text(35, 0, 'Bank'), Text(36, 0, 'Trade: type 3'), Text(37, 0, 'Industry: type 9'), Text(38, 0, 'Trade: type 6'), Text(39, 0, 'Industry: type 2'), Text(40, 0, 'Transport: type 1'), Text(41, 0, 'Industry: type 12'), Text(42, 0, 'Industry: type 4'), Text(43, 0, 'Insurance'), Text(44, 0, 'Mobile'), Text(45, 0, 'Trade: type 1'), Text(46, 0, 'Industry: type 5'), Text(47, 0, 'Industry: type 10'), Text(48, 0, 'Legal Services'), Text(49, 0, 'Trade: type 5'), Text(50, 0, 'Cleaning'), Text(51, 0, 'Industry: type 13'), Text(52, 0, 'Trade: type 4'), Text(53, 0, 'Telecom'), Text(54, 0, 'Industry: type 8'), Text(55, 0, 'Realtor'), Text(56, 0, 'Advertising'), Text(57, 0, 'Industry: type 6')])
# Define the column of interest
col = 'ORGANIZATION_TYPE'
# Define the hue column for differentiation
hue = 'NAME_CONTRACT_STATUS'
# Create a count plot showing the distribution of values in 'col' column, differentiated by 'hue' column, for Target=1
sns.countplot(x=col, hue=hue, data=ptrain_1)
# Rotate the x-axis labels for better readability
plt.xticks(rotation=90)
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57], [Text(0, 0, 'Business Entity Type 3'), Text(1, 0, 'Self-employed'), Text(2, 0, 'Industry: type 4'), Text(3, 0, 'Trade: type 2'), Text(4, 0, 'Construction'), Text(5, 0, 'Other'), Text(6, 0, 'Trade: type 3'), Text(7, 0, 'XNA'), Text(8, 0, 'Industry: type 3'), Text(9, 0, 'Industry: type 9'), Text(10, 0, 'Security'), Text(11, 0, 'Medicine'), Text(12, 0, 'Agriculture'), Text(13, 0, 'School'), Text(14, 0, 'Transport: type 2'), Text(15, 0, 'Trade: type 7'), Text(16, 0, 'Government'), Text(17, 0, 'Transport: type 4'), Text(18, 0, 'Industry: type 1'), Text(19, 0, 'Advertising'), Text(20, 0, 'Electricity'), Text(21, 0, 'Business Entity Type 1'), Text(22, 0, 'Legal Services'), Text(23, 0, 'Housing'), Text(24, 0, 'Industry: type 12'), Text(25, 0, 'Security Ministries'), Text(26, 0, 'Kindergarten'), Text(27, 0, 'Services'), Text(28, 0, 'Industry: type 11'), Text(29, 0, 'Industry: type 13'), Text(30, 0, 'Business Entity Type 2'), Text(31, 0, 'Industry: type 7'), Text(32, 0, 'Postal'), Text(33, 0, 'Police'), Text(34, 0, 'Bank'), Text(35, 0, 'Industry: type 2'), Text(36, 0, 'Military'), Text(37, 0, 'Trade: type 6'), Text(38, 0, 'Transport: type 3'), Text(39, 0, 'Restaurant'), Text(40, 0, 'University'), Text(41, 0, 'Trade: type 1'), Text(42, 0, 'Telecom'), Text(43, 0, 'Emergency'), Text(44, 0, 'Cleaning'), Text(45, 0, 'Industry: type 6'), Text(46, 0, 'Realtor'), Text(47, 0, 'Culture'), Text(48, 0, 'Hotel'), Text(49, 0, 'Mobile'), Text(50, 0, 'Insurance'), Text(51, 0, 'Industry: type 5'), Text(52, 0, 'Trade: type 5'), Text(53, 0, 'Religion'), Text(54, 0, 'Transport: type 1'), Text(55, 0, 'Industry: type 10'), Text(56, 0, 'Industry: type 8'), Text(57, 0, 'Trade: type 4')])
This column appears to be the most significant. Individuals associated with "Business Entity Type 3" and those who are self-employed tend to have the highest default rates. The univariate analysis of this dataset yielded more insightful results compared to the bivariate analysis.
Conclusion¶
Through the series of tasks performed, we gained valuable insights into the loan applicant dataset. We successfully identified missing values, explored the distribution of variables, analyzed outliers, and examined relationships between different attributes.
This analysis provides a foundational understanding of the dataset and its potential implications for loan management and risk assessment. These insights can guide decision-making processes, such as developing predictive models or refining loan approval criteria, ultimately enhancing the effectiveness of lending practices and minimizing default risks.